31 research outputs found
BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed
Software and database resource mentions across the whole of PubMed Central full-text articles
<p>This is a compressed .sql.gz file of a MySQL database dump. The table contains the automatically extracted mentions of database and software resource names as extracted by bioNerDS across the full sub-set of open-access full-text PubMed Central articles.</p>
<p>Each matched resource is identified by name, text offsets and "normalised" name, and also includes details of the rules from which the name was matched.</p>
<p>This dataset is one of the primary research contributions of my PhD work, and a paper currently being finalised for submission to PLoS Computational Biology. </p>
<p>Â </p
PubMed Central literature composition and analysis
<p>Compressed MySQL data dump of literature composition analyses including total token, sentence and syllable counts, Flesch readability scores, nouns, verbs and adjectives for the complete full-text open-access subset of PubMed Central.</p>
<p>This is one of the research contributions of my PhD.</p
Extracting patterns of database and software usage from the bioinformatics literature
Motivation: As a natural consequence of being a computer-based
discipline, bioinformatics has a strong focus on database and software
development, but the volume and variety of resources are growing at
unprecedented rates. An audit of database and software usage patterns
could help provide an overview of developments in bioinformatics
and community common practice, and comparing the links
between resources through time could demonstrate both the persistence
of existing software and the emergence of new tools.
Results: We study the connections between bioinformatics resources
and construct networks of database and software usage patterns,
based on resource co-occurrence, that correspond to snapshots of
common practice in the bioinformatics community. We apply our approach
to pairings of phylogenetics software reported in the literature
and argue that these could provide a stepping stone into the identification
of scientific best practice.
Availability and implementation: The extracted resource data, the
scripts used for network generation and the resulting networks are
available at http://bionerds.sourceforge.net/networks/
Ambiguity and variability of database and software names in bioinformatics
Background:
There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification.
Results:
Through the development and analysis of a corpus of 60 full-text documents manually annotated at the mention level, we report high variability and ambiguity in database and software mentions. On a test set of 25 full-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and ambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an F-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature.
Conclusions:
Our analyses show that identification of mentions of databases and tools is a challenging task that cannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows improvement and promise (primarily in precision), more contextual information needs to be taken into account to achieve a good degree of accuracy